Tagging the Past: Experiments using the Saga Corpus

نویسنده

  • Hrafn Loftsson
چکیده

There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagging Old Icelandic. Second, we semi-automatically correct errors in the training corpus using a bootstrapping method. Finally, we evaluate the taggers on the corrected training corpus. The best performing single tagger is Stagger, a tagger based on the averaged perceptron algorithm, obtaining an accuracy of 91.76%. By combining the output of three taggers, using a simple voting scheme, the accuracy increases to 92.32%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

Blending Segmentation With Tagging In Chinese Language Corpus Processing

this paper proposes a new method for Chinese language corpus processing. Unlike the past researches, our approach has following charactericstics : it blends segmentation with tagging and integrates nile-based approach with statistics-bascd one in grammatical dis-ambiguation. The principal ideas presented in the paper are incorporated in the development of a Chinese corpus processing system. Exp...

متن کامل

Estimation of Polychlorinated Biphenyls Intake through Fish Oil-Derived Dietary Supplements and Prescription Drugs in the Japanese Population

Background: Oily fish and their extracted oils may be a source of polychlorinated biphenyls (PCBs) which can induce toxic effects on the consumers. The main aim of this survey was estimation of PCBs intake through fish oil-derived dietary supplements and prescription drugs in the Japanese population. Methods: PCBs levels were determined in 20 fish oil-derived dietary supplements and 6 oil-deri...

متن کامل

A hidden Markov model for Persian part-of-speech tagging

One of the important actions in the processing of languages is part-of-speech tagging. Against of this importance, although numerous models have been presented in different languages but there is few works have been done in Persian language. In this paper, a part-of-speech tagging system on Persian corpus by using hidden Markov model is proposed. Achieving to this goal, the main aspects of Pers...

متن کامل

Morphological Segmentation and Part of Speech Tagging for Religious Arabic

We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained segmenter and POS tagger outperform the Arabic Treebak-trained ones although the latter is 21 times as big , which shows the need for building religious Arabic linguis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013